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Abstract 

Recent work has shown that simple vector 
subtraction over word embeddings is surpris¬ 
ingly effective at capturing different lexical 
relations, despite lacking explicit supervision. 
Prior work has evaluated this intriguing result 
using a word analogy prediction formulation 
and hand-selected relations, but the generality 
of the finding over a broader range of lexical 
relation types and different learning settings 
has not been evaluated. In this paper, we carry 
out such an evaluation in two learning settings: 
(1) spectral clustering to induce word rela¬ 
tions, and (2) supervised learning to classify 
vector differences into relation types. We find 
that word embeddings capture a surprising 
amount of information, and that, under suit¬ 
able supervised training, vector subtraction 
generalises well to a broad range of relations, 
including over unseen lexical items. 

1 Introduction 

Learning to identify lexical relations is a fundamen¬ 
tal task in natural language processing (“NLP”), and 
can contribute to many NLP applications including 
paraphrasing and generation, machine translation, 
and ontology building (Banko et al., 2007; Hen- 
drickx et al., 2010). 

Recently, attention has been focused on iden¬ 
tifying lexical relations using word embeddings, 
which are dense, low-dimensional vectors ob¬ 
tained either from a “predict-based” neural net¬ 
work trained to predict word contexts, or a “count- 
based” traditional distributional similarity method 
combined with dimensionality reduction. The skip- 
gram model of Mikolov et al. (2013a) and other 
similar language models have been shown to per¬ 
form well on an analogy completion task (Mikolov 
et al., 2013b; Mikolov et al., 2013c; Levy and 
Goldberg, 2014a), in the space of relational sim¬ 


ilarity prediction (Turney, 2006), where the task 
is to predict the missing word in analogies such 
as A:R :: C: A well-known example involves 

predicting the vector queen from the vector com¬ 
bination king — man -f woman, where linear 
operations on word vectors appear to capture the 
lexical relation governing the analogy, in this 
case OPPOSITE-GENDER. The results extend to 
several semantic relations such as CAPITAL-OE 
(paris — france -f poland ps warsaw) and mor- 
phosyntactic relations such as PLURALISATION 
(cars — car -f apple ps apples). Remarkably, 
since the model is not trained for this task, the re¬ 
lational structure of the vector space appears to be 
an emergent property. 

The key operation in these models is vector dif¬ 
ference, or vector offset. For example, the paris — 
france vector appears to encode CAPITAL-OF, pre¬ 
sumably by cancelling out the features of paris 
that are France-specific, and retaining the features 
that distinguish a capital city (Levy and Goldberg, 
2014a). The success of the simple offset method 
on analogy completion suggests that the difference 
vectors (“DiefVec” hereafter) must themselves 
be meaningful: their direction and/or magnitude 
encodes a lexical relation. 

Previous analogy completion tasks used with 
word embeddings have limited coverage of lexical 
relation types. Moreover, the task does not explore 
the full implications of DiffVecs as meaningful 
vector space objects in their own right, because 
it only looks for a one-best answer to the particu¬ 
lar lexical analogies in the test set. In this paper, 
we introduce a new, larger dataset covering many 
well-known lexical relation types from the linguis¬ 
tics and cognitive science literature. We then apply 
DiffVecs to two new tasks: unsupervised and su¬ 
pervised relation extraction. First, we cluster the 
DiefVecs to test whether the clusters map onto 
true lexical relations. We find that the clustering 



works remarkably well, although syntactic relations 
are captured better than semantic ones. 

Second, we perform classification over the DlFF- 
Vecs and obtain remarkably high accuracy in a 
closed-world setting (over a predefined sef of word 
pairs, each of which corresponds to a lexical re¬ 
lation in the training data). When we move to an 
open-world setting including random word pairs 
— many of which do not correspond to any lexical 
relation in the training data — the results are poor. 
We then investigate methods for better attuning the 
learned class representation to the lexical relations, 
focusing on methods for automatically synthesis¬ 
ing negative instances. We find that this improves 
the model performance substantially. 

We also find that hyper-parameter optimised 
count-based methods are competitive with prediet- 
based methods under both clustering and super¬ 
vised relation classification, in line with the find¬ 
ings of Levy et al. (2015a). 

2 Background and Related Work 

A lexical relation is a binary relation r holding be¬ 
tween a word pair {wi, Wj)', for example, the pair 
{cart, wheel) stands in the WHOLE-PART relation. 
Relation learning in NLP includes relation extrac¬ 
tion, relation classification, and relational similarity 
prediction. In relation extraction, related word pairs 
in a corpus and the relevant relation are identified. 
Given a word pair, fhe relation classification fask in¬ 
volves assigning a word pair fo fhe correef relafion 
from a pre-defined sef. In fhe Open Information Ex- 
fraefion paradigm (Banko el al., 2007; Weikum and 
Theobald, 2010), also known as unsupervised re¬ 
lation extraction, the relations themselves are also 
learned from the text (e.g. in the form of text labels). 
On the other hand, relational similarity prediction 
involves assessing the degree to which a word pair 
{A,B) stands in the same relation as another pair 
{C,D), or to complete an analogy A:6 Re¬ 

lation learning is an important and long-standing 
task in NLP and has been the focus of a number of 
shared tasks (Girju et al., 2007; Hendrickx et al., 
2010; Jurgens et al., 2012). 

Recently, attention has turned to using vector 
space models of words for relation classification 
and relational similarity prediction. Distributional 
word vectors have been used for detection of rela¬ 
tions such as hypernymy (Geffet and Dagan, 2005; 
Kotlerman et al., 2010; Lend and Benotto, 2012; 
Weeds et al., 2014; Rimell, 2014; Santus et al., 
2014) and qualia structure (Yamada et al., 2009). 


An exciting development, and the inspiration for 
this paper, has been the demonstration that vec¬ 
tor difference over word embeddings (Mikolov 
et al., 2013c) can be used to model word anal¬ 
ogy tasks. This has given rise to a series of pa¬ 
pers exploring the DiffVec idea in different con¬ 
texts. The original analogy dataset has been used to 
evaluate predict-based language models by Mnih 
and Kavukcuoglu (2013) and also Zhila et al. 
(2013), who combine a neural language model with 
a pattern-based classifier. Kim and de Mameffe 
(2013) use word embeddings to derive representa¬ 
tions of adjective scales, e.g. hot — warm — cool — 
cold. Fu et al. (2014) similarly use embeddings to 
predict hypemym relations, in this case clustering 
words by topic to show that hypernym DiffVecs 
can be broken down into more fine-grained rela¬ 
tions. Neural networks have also been developed 
for joint learning of lexical and relational similar¬ 
ity, making use of the WordNet relation hierarchy 
(Bordes et al., 2013; Socher et al., 2013; Xu et al., 
2014; Yu and Dredze, 2014; Faruqui et al., 2015; 
Fried and Duh, 2015). 

Another strand of work responding to the vector 
difference approach has analysed the structure of 
predict-based embedding models in order to help 
explain their success on the analogy and other tasks 
(Levy and Goldberg, 2014a; Levy and Goldberg, 
2014b; Arora et al., 2015). However, there has been 
no systematic investigation of the range of relations 
for which the vector difference method is most 
effective, although there have been some smaller- 
scale investigations in this direction. Makrai et al. 
(2013) divide antonym pairs into semantic classes 
such as quality, time, gender, and distance, find¬ 
ing that for about two-thirds of antonym classes, 
DiffVecs are significantly more correlated than 
random. Nec§ulescu et al. (2015) train a classifier 
on word pairs, using word embeddings to predict 
coordinates, hypernyms, and meronyms. Roller and 
Erk (2016) analyse the performance of vector con¬ 
catenation and difference on the task of predicting 
lexical entailment and show that vector concatena¬ 
tion overwhelmingly learns to detect Hears! pat¬ 
terns (e.g., including, such as). Kdper et al. (2015) 
undertake a systematic study of morphosyntac- 
tic and semantic relations on word embeddings 
produced with word2vec (“w2v” hereafter; see 
§3.1) for English and German. They test a variety 
of relations including word similarity, antonyms, 
synonyms, hypernyms, and meronyms, in a novel 
analogy task. Although the set of relations tested by 



Koper et al. (2015) is somewhat more constrained 
than the set we use, there is a good deal of overlap. 
However, their evaluation is performed in the con¬ 
text of relational similarity, and they do not perform 
clustering or classification on the DiffVecs. 

3 General Approach and Resources 

We define the task of lexical relation learning 
to take a set of (ordered) word pairs {{wi,Wj)} 
and a set of binary lexical relations R = {rk}, 
and map each word pair {wi,Wj) as follows: (a) 
{wi,Wj) rk ^ R, i.e. the “closed-world” set¬ 
ting, where we assume that all word pairs can be 
uniquely classified according to a relation in i?; or 
(b) {wi,Wj) 1 -^ Tfc G i? U {(/)} where (j) signifies 
the fact that none of the relations in R apply to the 
word pair in question, i.e. the “open-world” setting. 

Our starting point for lexical relation learning is 
the assumption that important information about 
various types of relations is implicitly embedded in 
the offset vectors. While a range of methods have 
been proposed for composing word vectors (Baroni 
et ah, 2012; Weeds et ah, 2014; Roller et ah, 2014), 
in this research we focus exclusively on DiffVec 
(i.e. W 2 — wi). A second assumption is that there 
exist dimensions, or directions, in the embedding 
vector spaces responsible for a particular lexical 
relation. Such dimensions could be identified and 
exploited as part of a clustering or classification 
method, in the context of identifying relations be¬ 
tween word pairs or classes of DiffVecs. 

In order to test the generalisability of the Diff¬ 
Vec method, we require: (1) word embeddings, 
and (2) a set of lexical relations to evaluate against. 
As the focus of this paper is not the word embed¬ 
ding pre-training approaches so much as the utility 
of the DiffVecs for lexical relation learning, we 
take a selection of four pre-trained word embed¬ 
dings with strong currency in the literature, as de¬ 
tailed in §3.1. We also include the state-of-the-art 
count-based approach of Levy et al. (2015a), to test 
the generalisability of DiffVecs to count-based 
word embeddings. 

For the lexical relations, we want a range of rela¬ 
tions that is representative of the types of relational 
learning tasks targeted in the literature, and where 
there is availability of annotated data. To this end, 
we constmct a dataset from a variety of sources, fo¬ 
cusing on lexical semantic relations (which are less 
well represented in the analogy dataset of Mikolov 
et al. (2013c)), but also including morphosyntactic 
and morphosemantic relations (see §3.2). 


Name 

Dimensions 

Training data 

w2v 

300 

100 X 10^ 

GloVe 

200 

6 X 10^ 

SENNA 

100 

37 X 10® 

HLBL 

200 

37 X 10® 

w2Vwiki 

300 

50 X 10® 

GloVewiki 

300 

50 X 10® 

SVD wiki 

300 

50 X 10® 


Table 1: The pre-trained word embeddings used in 
our experiments, with the number of dimensions 
and size of the training data (in word tokens). The 
models trained on English Wikipedia (“wiki”) are 
in the lower half of the table. 


3.1 Word Embeddings 

We consider four highly successful word embed¬ 
ding models in our experiments: w2v (Mikolov et 
al., 2013a; Mikolov et al., 2013b), GloVe (Pen¬ 
nington et al., 2014), SENNA (Collobert and We¬ 
ston, 2008), and HLBL (Mnih and Hinton, 2009), 
as detailed below. We also include SVD (Levy et 
ah, 2015a), a count-based model which factorises 
a positive PMI (PPMI) matrix. For consistency of 
comparison, we train SVD as well as a version 
of w2v and GloVe (which we call w2vwiki and 
GloVewiki> respectively) on the English Wikipedia 
corpus (comparable in size to the training data of 
SENNA and HLBL), and apply the preprocessing 
of Levy et al. (2015a). We additionally normalise 
the w2vwiki and SVDwiki vectors to unit length; 
GloVewiki is natively normalised by column.^ 
w2v CBOW (Continuous Bag-Of-Words; 
Mikolov et al. (2013 a)) predicts a word from its 
context using a model with the objective: 




exp w7 X] ^i+j 

\ j&[-c,+c],jf!=0 


i=l 


Ek=i exp wj X] ^i+j 

\ i6[-c,-|-c]jy0 


where w* and are the vector representations 
for the ith word (as a focus or context word, re¬ 
spectively), V is the vocabulary size, T is the 
number of tokens in the corpus, and c is the con¬ 
text window size.^ Google News data was used 

'We ran a series of experiments on normalised and unnor¬ 
malised w2v models, and found that normalisation tends to 
boost results over most of our relations (with the exception 
of LEXSEMEvent and NoUNcoii). We leave a more detailed 
investigation of normalisation to future work. 

^In a slight abuse of notation, the subscripts of w do double 



Relation 

Description 

Pairs 

Source 

Example 

LEXSEMHyper 

hypemym 

1173 

SemEvaP 12 + BLESS 

(animal, dog) 

LEXSEMMero 

meronym 

2825 

SemEval’12 +BLESS 

(airplane, cockpit) 

LEXSEMaiu- 

characteristic quality, action 

71 

SemEvaP 12 

(cloud, rain) 

LEXSEMcause 

cause, purpose, or goal 

249 

SemEvaP 12 

(cook, eat) 

LEXSEMspace 

location or time association 

235 

SemEvaP 12 

(aquarium, fish) 

LEXSEMRef 

expression or representation 

187 

SemEvaP 12 

(song, emotion) 

LEXSEMEvent 

object’s action 

3583 

BLESS 

(zip, coat) 

Nounsp 

plural form of a noun 

100 

MSR 

(year, years) 

VERB3 

first to third person verb present-tense form 

99 

MSR 

(accept, accepts) 

VERBpast 

present-tense to past-tense verb form 

100 

MSR 

(know, knew) 

VERBsPast 

third person present-tense to past-tense verb form 

100 

MSR 

(creates, created) 

Lve 

light verb construction 

58 

Tan et al. (2006b) 

(give, approval) 

VerbNoun 

nominalisation of a verb 

3303 

WordNet 

(approve, approval) 

Prefix 

prehxing with re morpheme 

118 

Wiktionary 

(vote, revote) 

Nouncou 

collective noun 

257 

Web source 

(army, ants) 


Table 2: Description of the 15 lexical relations. 


to train the model. We use the focus word vec¬ 
tors, W = normalised such that each 

llwfcll = 1. 

The GloVe model (Pennington et al., 2014) is 
based on a similar bilinear formulation, framed as 
a low-rank decomposition of the matrix of corpus 
co-occurrence frequencies: 

1 ^ 

= 2 ^ - log Fijf , 

ij=l 

where Wi is a vector for the left context, Wj is a 
vector for the right context, Pij is the relative fre¬ 
quency of word j in the context of word i, and / 
is a heuristic weighting function to balance the in¬ 
fluence of high versus low term frequencies. The 
model was trained on English Wikipedia and the 
English Gigaword corpus version 5. 

The SVD model (Levy et 3 . 1 ., 20153) uses poS“ 
itive pointwise mutu 3 l inform 3 tion (PMI) m 3 trix 
defined 3s: 

Piw c) 

PPMI(u;, c) = max(log 7 ——^, 0), 

P(w)F[c) 

where P{w,c) is the joint probability of word 
w and context c, and P{w) and P{c) are their 
marginal probabilities. The matrix is factorised by 
singular value decomposition. 

HLBL (Mnih and Hinton, 2009) is a log-bilinear 
formulation of an n-gram language model, which 
predicts the ith word based on context words {i — 
n,..., i — 2, i — 1). This leads to the following 

duty, denoting either the emhedding for the ith token, Wi, or 
fcth word type, Wfe. 


training objective: 

^ ^ exp(w 7 wi -h bj) 

^ hi ELi exp(#7Wfc + bk) ’ 

where Wj = *^he context embed¬ 

ding, Cj is a scaling matrix, and 6 * is a bias term. 

The hnal model, SENNA (Collobert and Weston, 
2008), was initially proposed for multi-task train¬ 
ing of several language processing tasks, from lan¬ 
guage modelling through to semantic role labelling. 
Here we focus on the statistical language modelling 
component, which has a pairwise ranking objective 
to maximise the relative score of each word in its 
local context: 

^ T V 

J = - max[ 0 ,1 - /(Wi_c,..., Wi_i, Wi) 

-h/(Wi_c,...,Wi_i,Wfc)] , 

where the last c — 1 words are used as context, and 
f{x) is a non-linear function of the input, dehned 
as a multi-layer perceptron. 

Eor HLBL and SENNA, we use the pre-trained 
embeddings from Turian et al. (2010), trained on 
the Reuters English newswire corpus. In both cases, 
the embeddings were scaled by the global stan¬ 
dard deviation over the word-embedding matrix, 
M4caled = 0.1 X 

Eor w2vwiki, GloVewiki and SYD^iki we used 
English Wikipedia. We followed the same prepro¬ 
cessing procedure described in Levy et al. (2015a),^ 
i.e., lower-cased all words and removed non-textual 
elements. During the training phase, for each model 

^Although the w2v model trained without preprocessing 
performed marginally better, we used preprocessing through¬ 
out for consistency. 



we set a word frequency threshold of 5. For the 
SVD model, we followed the recommendations of 
Levy et al. (2015 a) in setting the context window 
size to 2, negative sampling parameter to 1, eigen¬ 
value weighting to 0.5, and context distribution 
smoothing to 0.75; other parameters were assigned 
their default values. For the other models we used 
the following parameter values: for w2v, context 
window = 8, negative samples = 25, hs = 0, sample 
= le-4, and iterations = 15; and for GloVe, context 
window = 15, x_max =10, and iterations =15. 

3.2 Lexical Relations 

In order to evaluate the applicahility of the DlFF- 
Vec approach to relations of different types, we 
assembled a set of lexical relations in three broad 
categories: lexical semantic relations, morphosyn- 
tactic paradigm relations, and morphosemantic re¬ 
lations. We constrained the relations to be binary 
and to have fixed directionality.^ Consequently we 
excluded symmetric lexical relations such as syn¬ 
onymy. We additionally constrained the dataset to 
the words occurring in all embedding sets. There 
is some overlap between our relations and those in¬ 
cluded in the analogy task of Mikolov et al. (2013c), 
but we include a much wider range of lexical se¬ 
mantic relations, especially those standardly evalu¬ 
ated in the relation classification literature. We man¬ 
ually filtered the data to remove duplicates (e.g., as 
part of merging the two sources of LExSEMnyper 
intances), and normalise directionality. 

The final dataset consists of 12,458 triples 
(relation, word 1 , word 2 ), comprising 15 relation 
types, extracted from SemEval’12 (Jurgens et al., 
2012), BLESS (Baroni and Lend, 2011), the MSR 
analogy dataset (Mikolov et al., 2013c), the light 
verb dataset of Tan et al. (2006a), Princeton Word- 
Net (Fellbaum, 1998), Wiktionary,^ and a web lex¬ 
icon of collective nouns,^ as listed in Table 2? 

4 Clustering 

Assuming DiffVecs are capable of capturing all 
lexical relations equally, we would expect cluster¬ 
ing to be able to identify sets of word pairs with 

"'Word similarity is not included; it is not easily captured 
by DiffVec since there is no homogeneous “content” to the 
lexical relation which could he captured by the direction and 
magnitude of a difference vector (other than that it should be 
small). 

^http://en.wiktionary.org 

^http://www.rinkworks.com/words/collective. 
shtml 

’The dataset is available at http://github.com/ivri/ 
DiffVec 
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Figure 1: t-SNE projection (Van der Maaten and 
Flinton, 2008) of DiffVecs for 10 sample word 
pairs of each relation type, based on w2v. The 
intersection of the two axes identify the projection 
of the zero vector. Best viewed in colour. 

high relational similarity, or equivalently clusters 
of similar offset vectors. Under the additional as¬ 
sumption that a given word pair corresponds to 
a unique lexical relation (in line with our defini¬ 
tion of the lexical relation learning task in §3), a 
hard clustering approach is appropriate. In order to 
test these assumptions, we cluster our 15-relation 
closed-world dataset in the first instance, and eval¬ 
uate against the lexical resources in §3.2. 

As further motivation, we projected the DlFF- 
V EC space for a small number of samples of each 
class using t-SNE (Van der Maaten and Flinton, 
2008), and found that many of the morphosyntactic 
relations (VERB 3 , VERBpast, VERB 3 Past, NOUNsp) 
form tight clusters (Figure 1). 

We cluster the DiffVecs between all word 
pairs in our dataset using spectral clustering 
(Von Luxburg, 2007). Spectral clustering has two 
hyperparameters: the number of clusters, and the 
pairwise similarity measure for comparing Diff¬ 
Vecs. We tune the hyperparameters over devel¬ 
opment data, in the form of 15% of the data ob¬ 
tained by random sampling, selecting the configura¬ 
tion that maximises the V-Measure (Rosenberg and 
Flirschberg, 2007). Figure 2 presents V-Measure 
values over the test data for each of the four word 
embedding models. We show results for different 
numbers of clusters, from = 10 in steps of 10, 
up to = 80 (beyond which the clustering quality 
diminishes).^ Observe that w2v achieves the best 

* Although 80 clusters 3> our 15 relation types, the Se- 
mEvalT2 classes each contain numerous subclasses, so the 
larger number may be more realistic. 
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Figure 2: Spectral clustering results, comparing 
cluster quality (V-Measure) and the number of clus¬ 
ters. DiffVecs are clustered and compared to the 
known relation types. Each line shows a different 
source of word embeddings. 

results, with a V-Measure value of around 0.36,^ 
which is relatively constant over varying numbers 
of clusters. GloVe and SVD mirror this result, 
but are consistently below w2v at a V-Measure 
of around 0.31. HLBL and SENNA performed very 
similarly, at a substantially lower V-Measure than 
w2v or GloVe, closer to 0.21. As a crude calibra¬ 
tion for these results, over the related clustering 
task of word sense induction, the best-performing 
systems in SemEval-2010 Task 4 (Manandhar et 
ah, 2010) achieved a V-Measure of under 0.2. 

The lower V-measure for w2vwiki and 
GloVewiki (as compared to w2v and GloVe, 
respectively) indicates that the volume of training 
data plays a role in the clustering results. However, 
both methods still perform well above SENNA and 
HLBL, and w2v has a clear empirical advantage 
over GloVe. We note that SVDwiki performs 
almost as well as w2vv^,iki^ consistent with the 
results of Levy et al. (2015 a). 

We additionally calculated the entropy for each 
lexical relation, based on the distribution of in¬ 
stances belonging to a given relation across the 
different clusters (and simple MLE). For each em¬ 
bedding method, we present the entropy for the 
cluster size where V-measure was maximised over 
the development data. Since the samples are dis¬ 
tributed nonuniformly, we normalise entropy re¬ 
sults for each method by log(n) where n is the 
number of samples in a particular relation. The re- 

®V-Measure returns a value in the range [0,1], with 1 indi¬ 
cating perfect homogeneity and completeness. 



w2v 

GloVe 

HLBL 

SENNA 

LEXSEMAttr 

0.49 

0.54 

0.62 

0.63 

LEXSEMcausc 

0.47 

0.53 

0.56 

0.57 

LEXSEMspace 

0.49 

0.55 

0.54 

0.58 

LEXSEMRef 

0.44 

0.50 

0.54 

0.56 

LEXSEMHyper 

0.44 

0.50 

0.43 

0.45 

LEXSEMEvent 

0.46 

0.47 

0.47 

0.48 

LEXSEMMero 

0.40 

0.42 

0.42 

0.43 

NOUNsp 

0.07 

0.14 

0.22 

0.29 

Verbs 

0.05 

0.06 

0.49 

0.44 

VERBpaa, 

0.09 

0.14 

0.38 

0.35 

VERBspast 

0.07 

0.05 

0.49 

0.52 

LVC 

0.28 

0.55 

0.32 

0.30 

VerbNoun 

0.31 

0.33 

0.35 

0.36 

Prefix 

0.32 

0.30 

0.55 

0.58 

NOUNcoU 

0.21 

0.27 

0.46 

0.44 


Table 3: The entropy for each lexical relation over 
the clustering output for each set of pre-trained 
word embeddings. 


suits are in Table 3, with the lowest entropy (purest 
clustering) for each relation indicated in bold. 

Looking across the different lexical relation 
types, the morphosyntactic paradigm relations 
(NouNsp and the three Verb relations) are by 
far the easiest to capture. The lexical semantic rela¬ 
tions, on the other hand, are the hardest to capture 
for all embeddings. 

Considering w2v embeddings, for VERB 3 there 
was a single cluster consisting of around 90% 
of VERB 3 word pairs. Most errors resulted from 
POS ambiguity, leading to confusion with Verb- 
Noun in particular. Example VERB 3 pairs incor¬ 
rectly clustered are: {study, studies), {run, runs), 
and {like, likes). This polysemy results in the dis¬ 
tance represented in the DiffVec for such pairs 
being above average for Verb 3 , and consequently 
clustered with other cross-POS relations. 

Eor VERBpast, a single relatively pure cluster 
was generated, with minor contamination due 
to pairs such as {hurt, saw), {utensil, saw), and 
{wipe, saw). Here, the noun saw is ambiguous with 
a high-frequency past-tense verb; hurt and wipe 
also have ambigous POS. 

A related phenomenon was observed for 
NouNcoII, where the instances were assigned to 
a large mixed cluster containing word pairs where 
the second word referred to an animal, reflect¬ 
ing the fact that most of the collective nouns in 
our dataset relate to animals, e.g. {stand, horse), 
{ambush, tigers), {antibiotics, bacteria). This is in¬ 
teresting from a DiffVec point of view, since it 
shows that the lexical semantics of one word in 
the pair can overwhelm the semantic content of the 
DiffVec (something that we return to investigate 
in §5.4). LExSEMMero was also split into multiple 

















clusters along topical lines, with separate clusters 
for weapons, dwellings, vehicles, etc. 

Given the encouraging results from our cluster¬ 
ing experiment, we next evaluate DiffVecs in a 
supervised relation classification setting. 

5 Classification 

A natural question is whether we can accurately 
characterise lexical relations through supervised 
learning over the DiffVecs. For these experi¬ 
ments we use the w2v, w2vwiki> and SVDwiki em¬ 
beddings exclusively (based on their superior per¬ 
formance in the clustering experiment), and a sub¬ 
set of the relations which is both representative of 
the breadth of the full relation set, and for which 
we have sufficient data for supervised training 
and evaluation, namely: NoUNcoii, LExSEMEvent> 
LEXSEMHyper, LEXSEMMero, NOUNsp, PREFIX, 
VERB3, VERB3Past, and VERBpast (see Table 2). 

We consider two applications: (1) a CLOSED- 
WORLD setting similar to the unsupervised evalua¬ 
tion, in which the classifier only encounters word 
pairs which correspond fo one of fhe nine relafions; 
and (2) a more challenging Open-World selling 
where random word pairs — which may or may nol 
correspond lo one of our relations — are included 
in the evaluation. For both settings, we further in¬ 
vestigate whether there is a lexical memorisation 
effect for a broad range of relation types of the 
sort identified by Weeds el al. (2014) and Levy el 
al. (2015b) for hypernyms, by experimenling wilh 
disjoinl Iraining and lesl vocabulary. 

5,1 Closed-World Classification 

For the Closed-World setting, we train and 
test a multiclass classifier on dalasels comprising 
(DlFFVEC,r) pairs, where r is one of our nine 
relation types, and DifeVec is based on one of 
w2v, w2vwiki and SVD. As a baseline, we cluster 
the data as described in §4, running the clusterer 
several times over the 9-relation data to select the 
optimal V-Measure value based on the develop¬ 
ment data, resulting in 50 clusters. We label each 
cluster with the majority class based on the training 
instances, and evaluate the resultant labelling for 
the test instances. 

We use an SVM with a linear kernel, and report 
results from 10-fold cross-validation in Table 4. 

The SVM achieves a higher F-score than the 
baseline on almost every relation, particularly on 
LEXSEMHyper: and the lower-frequency NoUNsp, 
Nouncoii, and Prefix. Most of the relations — 


Relation 

Baseline 

w2v 

w2v„iki 

SVD wiki 

LEXSEMHyper 

0.60 

0.93 

0.91 

0.91 

LEXSEMMero 

0.90 

0.97 

0.96 

0.96 

LEXSEMEvent 

0.87 

0.98 

0.97 

0.97 

NOUNsp 

0.00 

0.83 

0.78 

0.74 

Verb 3 

0.99 

0.98 

0.96 

0.97 

VERBpas, 

0.78 

0.98 

0.98 

0.95 

VERBsPast 

0.99 

0.98 

0.98 

0.96 

Prefix 

0.00 

0.82 

0.34 

0.60 

NouNcoh 

0.19 

0.95 

0.91 

0.92 

Micro-average 

0.84 

0.97 

0.95 

0.95 


Table 4: F-scores for Closed-World classi¬ 
fication, for a baseline method based on clustering 
-I- majority-class labelling, a multiclass linear SVM 
trained on w2v, w2vwiki and SYD^iki DiefVec 
inputs. 

even the most difficult ones from our clustering 
experiment — are classified wifh very high F- 
score. Thai is, with a simple linear transforma¬ 
tion of the embedding dimensions, we are able to 
achieve near-perfect results. The Prefix relation 
achieved markedly lower recall, resulting in a lower 
F-score, due to large differences in the predomi¬ 
nant usages associated with the respective words 
(e.g., (union, reunion), where the vector for union 
is heavily biased by contexts associated with trade 
unions, but reunion is heavily biased by contexts re¬ 
lating to social get-togethers; and [entry, reentry), 
where entry is associated with competitions and en¬ 
trance to schools, while reentry is associated with 
space travel). Somewhat surprisingly, given the 
small dimensionality of the input (vectors of size 
300 for all three methods), we found that the lin¬ 
ear SVM slightly outperformed a non-linear SVM 
using an RBF kernel. We observe no real differ¬ 
ence between w2vwiki and SYD^iki. supporting the 
hypothesis of Levy et al. (2015a) that under ap¬ 
propriate parameter settings, count-based methods 
achieve high results. The impact of the training data 
volume for pre-training of the embeddings is also 
less pronounced than in the case of our clustering 
experiment. 

5.2 Open-World Classification 

We now turn to a more challenging evaluation set¬ 
ting: a test set including word pairs drawn at ran¬ 
dom. This setting aims to illustrate whether a DlFF- 
VEC-based classifier is capable of differentiating 
related word pairs from noise, and can be applied 
to open data to learn new related word pairs.'*’ 

For these experiments, we train a binary classi¬ 
fier for each relation type, using | of our relation 

'^Hereafter we provide results for w2v only, as we found 
that SVD achieved similar results. 



Relation 


Orig 



-fneg 


p 

11 

J- 

V 

11 

J- 

LEXSEMHyper 

0.95 

0.92 

0.93 

0.99 

0.84 

0.91 

LEXSEMMero 

0.13 

0.96 

0.24 

0.95 

0.84 

0.89 

LEXSEMEvent 

0.44 

0.98 

0.61 

0.93 

0.90 

0.91 

NOUNsp 

0.95 

0.68 

0.8 

1.00 

0.68 

0.81 

Verbs 

0.75 

1.00 

0.86 

0.93 

0.93 

0.93 

VERBpast 

0.94 

0.86 

0.90 

0.97 

0.84 

0.90 

VERBspast 

0.76 

0.95 

0.84 

0.87 

0.93 

0.90 

Prefix 

1.00 

0.29 

0.44 

1.00 

0.13 

0.23 

NOUNcoU 

0.43 

0.74 

0.55 

0.97 

0.41 

0.57 


Table 5: Precision (V) and recall (TZ) for Open- 
World classification, using the binary classifier 
without (“Orig”) and with (“+neg”) negative sam¬ 
ples . 

data for training and ^ for testing. The test data is 
augmented with an equal quantity of random pairs, 
generated as follows: 

(1) sample a seed lexicon by drawing words pro¬ 
portional to their frequency in Wikipedia; 

(2) take the Cartesian product over pairs of words 
from the seed lexicon; 

(3) sample word pairs uniformly from this set. 
This procedure generates word pairs that are repre¬ 
sentative of the frequency profile of our corpus. 

We train 9 binary RBF-kemel SVM classifiers 
on the training partition, and evaluate on our ran¬ 
domly augmented test set. Fully annotating our 
random word pairs is prohibitively expensive, so 
instead, we manually annotated only the word pairs 
which were positively classified by one of our mod¬ 
els. The results of our experiments are presented 
in the left half of Table 5, in which we report 
on results over the combination of the original 
test data from §5.1 and the random word pairs, 
noting that recall (TZ) for Open-World takes 
the form of relative recall (Pantel et ah, 2004) 
over the positively-classified word pairs. The re¬ 
sults are much lower than for the closed-word set¬ 
ting (Table 4), most notably in terms of precision 
CP). For instance, the random pairs {have, works), 
{turn, took), and {works, started) were incorrectly 
classified as VERB 3 , VERBpast and VERB 3 Past, re¬ 
spectively. That is, the model captures syntax, but 
lacks the ability to capture lexical paradigms, and 
tends to overgenerate. 

5.3 Open-World Training with Negative 
Sampling 

To address the problem of incorrectly classifying 
random word pairs as valid relations, we retrain the 
classifier on a dataset comprising both valid and 

"Filtered to consist of words for which we have embed¬ 
dings. 


automatically-generated negative distractor sam¬ 
ples. The basic intuition behind this approach is 
to construct samples which will force the model 
to learn decision boundaries that more tightly cap¬ 
ture the true scope of a given relation. To this end, 
we automatically generated two types of negative 
distr actors: 

opposite pairs: generated by switching the or¬ 
der of word pairs, Oppos^j = -wordi — 
word 2 . This ensures the classifier adequately 
captures the asymmetry in the relations, 
shuffled pairs: generated by replacing W 2 with 
a random word w '2 from the same relation, 
Shujf ^2 ^2 = ■word 2 — wordi. This is tar¬ 
geted at relations that take specific word 
classes in particular positions, e.g., (VB, VBD) 
word pairs, so that the model learns to encode 
the relation rather than simply learning the 
properties of the word classes. 

Both types of distractors are added to the train¬ 
ing set, such that there are equal numbers of valid 
relations, opposite pairs and shuffled pairs. 

After training our classifier, we evaluate its pre¬ 
dictions in the same way as in §5.2, using the same 
test set combining related and random word pairs. 
The results are shown in the right half of Table 5 (as 
“-Fneg”). Observe that the precision is much higher 
and recall somewhat lower compared to the classi¬ 
fier trained with only positive samples. This follows 
from the adversarial training scenario: using nega¬ 
tive distractors results in a more conservative classi¬ 
fier, that correctly classifies the vast majority of the 
random word pairs as not corresponding to a given 
relation, resulting in higher precision at the expense 
of a small drop in recall. Overall this leads to higher 
F-scores, as shown in Figure 3, other than for hy- 
pemyms (LExSEMnyper) and prefixes (Prefix). 
For example, the standard classifier for NoUNcoii 
learned to match word pairs including an animal 
name (e.g., {plague, rats)), while training with neg¬ 
ative samples resulted in much more conservative 
predictions and consequently much lower recall. 
The classifier was able to capture {herd, horses) but 
not {run, salmon), {party, jays) or {singular, boar) 
as instances of NoUNcoii, possibly because of poly¬ 
semy. The most striking difference in performance 
was for LExSEMMero: where the standard classi¬ 
fier generated many false positive noun pairs (e.g. 
{series, radio)), but the false positive rate was con¬ 
siderably reduced with negative sampling. 

*^But noting that relative recall for the random word pairs 
is based on the pool of positive predictions from both models. 



Usage of Negative Samples 



Figure 3: F-score for Open- WORLD classification, 
comparing models trained with and without nega¬ 
tive samples. 

5.4 Lexical Memorisation 

Weeds et al. (2014) and Levy et al. (2015h) re¬ 
cently showed that supervised methods using DlFF- 
VECs achieve artificially high results as a result of 
“lexical memorisation” over frequent words asso¬ 
ciated with the hypemym relation. For example, 
{animal, cat), {animal, dog), and {animal,pig) all 
share the superclass animal, and the model thus 
learns to classify as positive any word pair with 
animal as the first word. 

To address this effect, we follow Levy et al. 
(2015h) in splitting our vocabulary into training and 
test partitions, to ensure there is no overlap between 
training and test vocabulary. We then train classi¬ 
fiers with and without negative sampling (§5.3), 
incrementally adding the random word pairs from 
§5.2 to the test data (from no random word pairs 
to five times the original size of the test data) to in¬ 
vestigate the interaction of negative sampling with 
greater diversity in the test set when there is a split 
vocabulary. The results are shown in Figure 4. 

Observe that the precision for the standard clas¬ 
sifier decreases rapidly as more random word pairs 
are added to the test data. In comparison, the pre¬ 
cision when negative sampling is used shows only 
a small drop-off, indicating that negative sampling 
is effective at maintaining precision in an Open- 
World setting even when the training and test 
vocabulary are disjoint. This benefit comes at the 
expense of recall, which is much lower when neg¬ 
ative sampling is used (note that recall stays rela¬ 
tively constant as random word pairs are added, as 
the vast majority of them do not correspond to any 
relation). At the maximum level of random word 
pairs in the test data, the F-score for the negative 
sampling classifier is higher than for the standard 
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Figure 4: Evaluation of the Open-World model 
when trained on split vocabulary, for varying num¬ 
bers of random word pairs in the test dataset (ex¬ 
pressed as a multiplier relative to the number of 
Closed-World test instances). 



classifier. 

6 Conclusions 

This paper is the first to test the generalisability 
of the vector difference approach across a broad 
range of lexical relations (in raw number and also 
variety). Using clustering we showed that many 
types of morphosyntactic and morphosemantic dif¬ 
ferences are captured by DiffVecs, but that lexical 
semantic relations are captured less well, a find¬ 
ing which is consistent with previous work (Kdper 
et al., 2015). In contrast, classification over the 
DiffVecs works extremely well in a closed-world 
setting, showing that dimensions of DiffVecs en¬ 
code lexical relations. Classification performs less 
well over open data, although with the introduction 
of automatically-generated negative samples, the 
results improve substantially. Negative sampling 
also improves classification when the training and 
test vocabulary are split to minimise lexical mem¬ 
orisation. Overall, we conclude that the DifeVec 
approach has impressive utility over a broad range 
of lexical relations, especially under supervised 
classification. 
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